Adaptive gradient descent (AdaGrad) is an optimizer that dynamically adapts its learning rate to reflect the partial derivative of the loss
As seen below, AdaGrad accumulates squared (i.e., positive) terms for each parameter. As a result, AdaGrad tends to decrease the learning rate over time until it approaches zero. RMSprop was developed to address this problem, and Adam has largely superseded both as it also incorporates momentum.
To achieve adaptation, AdaGrad maintains a state vector
While the above is technically true, in practice the vector is updated iteratively as
AdaGrad then uses this state vector to calculate the change in parameters
where